On Linking Heterogeneous Dataset Collections

نویسندگان

  • Mayank Kejriwal
  • Daniel P. Miranker
چکیده

Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) specification. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of link specifications to entity pairs within blocks. Current link-discovery blocking methods explicitly assume that two RDF datasets are provided as input, and need to be linked. In this paper, we assume instead that two heterogeneous dataset collections, comprising arbitrary numbers of RDF and tabular datasets, are provided as input. We show that data model heterogeneity can be addressed by representing RDF datasets as property tables. We also propose an unsupervised technique called dataset mapping that maps datasets from one collection to the other, and is shown to be compatible with existing clustering methods. Dataset mapping is empirically evaluated on three real-world test collections ranging over government and constitutional domains, and shown to improve two established baselines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A two-step blocking scheme learner for scalable link discovery

A two-step procedure for learning a link-discovery blocking scheme is presented. Link discovery is the problem of linking entities between two or more datasets. Identifying owl:sameAs links is an important, special case. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of l...

متن کامل

Entity Linking to One Thousand Knowledge Bases

We address the task of entity linking to multiple knowledge bases (KB). In particular, we investigate the use of over one thousand domain-specific KBs derived from Wikia.com collections in conjunction with the Wikipedia collection as a background-knowledge repository. Our system employs a two-step approach: for each document, a supervised model with a large set of features detects whether there...

متن کامل

Exploration of Audiovisual Heritage Using Audio Indexing Technology

This paper discusses audio indexing tools that have been implemented for the disclosure of Dutch audiovisual cultural heritage collections. It explains the role of language models and their adaptation to historical settings and the adaptation of acoustic models for homogeneous audio collections. In addition to the benefits of cross-media linking, the requirements for successful tuning and impro...

متن کامل

SPE-174907-MS Rapid Data Integration and Analysis for Upstream Oil and Gas Applications

The increasingly large number of sensors and instruments in the oil and gas industry, along with novel means of communication in the enterprise has led to a corresponding increase in the volume of data that is recorded in various information repositories. The variety of information sources is also expanding: from traditional relational databases to time series data, social network communication...

متن کامل

Modularity based community detection in heterogeneous networks

Heterogeneous networks are networks consisting of different types of nodes and multiple types of edges linking such nodes. While community detection has been extensively developed as a useful technique for analyzing networks that contain only one type of nodes, very few community detection techniques have been developed for heterogeneous networks. In this paper, we propose a modularity based co...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014